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Abstract 

Recently there has been renewed interests in derivative free approaches to 
stochastic optimization. In this paper, we examine the rates of convergence 
for the Kiefer-Wolfowitz algorithm and the mirror descent algorithm, nnder 
various updating schemes using hnite differences as gradient approximations. 

It is shown that the convergence of these algorithms can be accelerated by con¬ 
trolling the implementation of the hnite differences. Particularly, it is shown 
that the rate can be increased to in general and to in Monte Carlo 
optimization for a broad class of problems, in the iteration number n. 
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1. Introduction. Let i? denote the set of real numbers. Consider a real-valued function 
J{6) of the form J{d) = Ex[L{X{6))] where 0 is a parameter, or a vector of parameters, 
L{X) is a real-valued function, and X{d) is a random variable that depends on 6. For 
simplicity, throughout this paper we assume that 6* is a scalar and 9 E Q G R, and X{9) 
is of the form X{6) = X{6,^), where is a random variable independent of 6. In such a 
formulation, X (6) is parameterized on an underlying probability space that is independent of 
6. For any two random variables rj and there exists a Borel function 0 such that rj = 
[Shiryayev (1984), p.l72]. Such a representation for X{9,^) is always possible. Therefore, 
J{d) can be written as J{6) = E^[L{X{d,^))]. We are particularly interested in hnding an 
optimal parameter 6* E Q to optimize, say minimize, J{9). This is a challenging problem 
since the analytical form of J{6) is usually unavailable for most problems of interest. What 
is obtainable are the sample measurements of the random value of L{X{d,^)). We have to 
use the information on L{X{6,^)) to find 6*. Such stochastic optimization problems can 
be found in many applications. The main approach to hnding the optimal solution is to 
successively approximate 9* via algorithms of stochastic approximation. This is a classical 
and standard approach that has been adopted in practice for decades. The Robbins-Monro 
(RM) algorithm, the Kiefer-Wolfowitz (KW) algorithm, and the relatively recent mirror 
descent (MD) algorithm are the most popular algorithms of this class. 

The RM algorithm, introduced by Robbins and Monro (1951), hnds 9* in the following 
way. Let 9o be selected and {a„} a sequence of positive numbers. For each integer n > 0, let 

(1) ^n+l ~ 9n anfjn 

where gn is an unbiased estimate of the derivative J'{9) of J{9) with respect to 9. Assume 
that J\9) exists on 0 and the variance of Qn is uniformly bounded for all n. Assume 
{9-9*)J'{9) >0for all9y9 9*, 

= oo, < oo, 

n n 

and that several other technical conditions are satished. Then {9n} converges to 9* with 
probability one. The convergence rate (in terms of root mean square error) is Note 

that this is the best possible rate of convergence for algorithms of the form (1) for stochastic 
optimization [see, e.g., Fabian (1971)]. 

The KW algorithm, introduced by Kiefer and Wolfowitz (1952), is a modihcation of the 
RM algorithm by approximating the gradient using a finite difference and hnds 9* recursively 
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by 

( 2 ) 


^n+l (^nhni 


where 

, L{X{e^ + ei,n)) - L{x{e^ - 6n, 6,n)) 

— rij: ’ 

2dfi 

{(5„} is a sequence of positive numbers, L{X{6n + 5n,Ci,n)) and L{X{6n — Sn,^ 2 ,n)) are two 
measurements of L{X{9,^)) at 9n + 5n and 9n — Sn, and ^i,n,^ 2 ,n are corresponding samples 
of Kiefer-Wolfowitz (1952) proved that if J(9) is decreasing for 9 < 9* and increasing for 
9 > 9*, and if 

Sn ^ 0, 'y ) Clfi CX3, y ) Ojn^n ^ OO^ ^ ) ^n/^ OO^ 


the sequence {9n} converges to 9* with probability one under some additional minor condi¬ 
tions. If all entries in are mutually independent, the best possible convergence rate for 

the KW algorithm (2) is which is achieved by choosing = an~^,6n = with 

a,d>0 constants [e.g. Burkholder (1956); Fabian (1971); Sacks (1958)]. The rate is 
regarded not satisfactory compared to the best possible rate for the RM algorithm. 

The MD algorithm, introduced by Nemirovski and Yudin (1983), improves the robustness 
of gradient based optimization algorithms. At iteration n > 0, 9n+i is updated via solving 


(4) 


9n+i = argmin^^e l<hn,9> +—D^{9, 6'„)|, 


where hn is an estimate of the derivative J'{9), H : 0 x 0 —)■ is a Bregman distance 

dehned as 

(5) D{9,t) := iIj{9) — < iIj'{t),9 — T >> k\\9 — 


where ip{.) is a distance generating function and k > 0 is a constant. In (5), ||.|| is a general 
norm on R™ (and on R in this paper). It has been established by Nemirovski et ah (2009) 
and Duchi et ah (2012,2013) that if J{9) is convex, Lipschitz continuous and 


Qin ^ 0 , y ] Qiji OO^ 

n 

1 ^ a- 

9n = - ^9i OI 9n = Vi = i = 1, 2,... 

a,' 

then J{9n) converges to the minimum of J{9) and the rate of convergence is under 

mild technical conditions that will be specihed in Section 3. 

The convergence of these algorithm is fairly understood [Burkholder (1956); Fabian 
(1971); Kushner and Clark (1978); Chung (1954); Dupac (1957); Dvoretzky (1956); Sacks 
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(1958); Wasan (1969), Nemirovski et al. (2009), Duchi et al. (2012, 2013)]. The conditions 
for the convergence of these algorithms can be made substantially weaker than those we have 
previously mentioned [e.g. Kushner and Clark (1978); Wasan (1969)]. The convergence rate 
for the RM algorithm is much faster than that for the KW algorithm. This is not surprising 
if we note that the KW algorithm uses the hnite difference hn as an approximation to the 
derivative J'{0), while the RM algorithm uses an unbiased estimate of J'{0). Therefore, the 
faster rate is achieved at the cost of obtaining an unbiased estimate of the derivative J'{0) 
that is often challenging in practice. On the other hand, although its convergence is slower, 
the KW algorithm requires no detailed information on the function J{9). It is simple to use 
and applicable to a wide range of problems. Kesten (1958) suggests that the stepsize be 
chosen according to the fluctuation in the signs of Qn and hn- A few of other techniques 
for the acceleration of stochastic approximation algorithms can be found in Wasan (1986). 
None of these accelerating techniques can improve the rate of convergence of the algorithms 
under study. 

In this paper, we are interested in the acceleration of the KW algorithm and the MD 
algorithm through controlling the estimation of the derivative using hnite differences. Fur¬ 
thermore, we consider the employment of the scheme of common random numbers (CRN) 
for improving the convergence of the algorithms — that is, the random factors and 
^ 2 ,n are chosen in such a manner that = ^ 2 ,n = ^n- Implementation of CRN in Monte 
Carlo optimization is rather straightforward. The term “Monte Carlo optimization” is used 
here to refer to the procedure of hnding the optimal solutions through computer simulation 
where the random factors, represented through a sequence of psuedo-random numbers, can 
be controlled [see Bratley et al. (1983)]. Computer simulation is often necessary when the 
form of L{X{9,^)) is too complicated. This is the case when L{X{9,^)) represents a per¬ 
formance measure of a stochastic system such as queueing systems, manufacturing systems, 
transportation systems, and communications networks [see, e.g. Ho and Cao (1991); Bratley 
et al. (1983); Law and Kelton (1982)]. In Section 6, we will give an example where the 
scheme of common random numbers is feasible. 

We use the term CRN in a much narrow sense. The term CRN has more general and 
sometimes ill-posed meaning than we intend in this paper [see Glasserman and Yao (1992)]. 
In this paper, CRN simply refers to that simulation experiments be performed with the 
same stream of random numbers. As far as the KW algorithm (2) or the MD algorithm 
(4) is concerned, CRN requires that estimates of J{9 + 5) and .J{9 — 5) be obtained from 
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simulation experiments using the same stream of random numbers {■Cn}- Let F{6,x) denote 
the distribution function of X{6,^). Then any experiments with hn constructed in the 
following form conform the CRN requirement: 

,,, -L(F2(^n,5n,en)) 

- 

where the marginal distributions of Yi{9n,Sn,^n) and Y2{9n,Sn,^n) are F{9n + Sn,x) and 
F{9n—Sn, x), respectively. Note that the joint distribution of Yi{9n, Sn, Cn) and Y2{9n, Sn, Cn) is 
left open, which may be used to improve the estimation variance. For a distribution function 
F{9,x), its inverse function is dehned as F~^{9,x) inf{M | F{9,u) > x}. Cambanis 
and Simons (1976) and Whitt (1976) proved that the variance of (6) is minimized when 
Yi(9n,Sn,(n) = F-\9n + Cn) and ^ 2 (6'n, ^n, Cn) = P-^iOn “ <5n,Cn)- In this paper we 
assume that the form of L{X{9,^)) is given and fixed. The term CRN merely refers to the 
special choice of = ^ 2 ,n- We will show that the use of CRN can signihcantly increase 
the rate of convergence for the KW algorithm (2) or the MD algorithm (4) from to 

at least n~‘^P. For a large class of functions, the rate can be increased to ^ the best 
possible rate for stochastic approximation algorithms. CRN increases the rate of convergence 
by reducing the variance of hn- Let Rar[X] denote the mathematical variance of a random 
variable X. Assume that Var[L{X{9,^))] is continuous in 9, is bounded from below by a 
positive constant and from above by a constant. Then if and ^ 2 ,n are independent, the 
variance of hn is 

{Var[L{X{9n + 5n,6,n))] + Var[L{X{9n - 5n, 6,n))])/(2<5n)' = 0{l/5l) 

which grows quadratically as 5n goes to zero. We say a variable /(s) = 0(s) if |/(<s)/s| < 
C, C > 0 is a constant independent of s (/(s) = o(s) if hm|/(s)/s| = 0 when s goes to 
zero or inhnity depending on the context). It is such a large variance of hn that slows 
down the convergence rate since, when 5n is suitably chosen, the rate would be if the 

variance of hn is bounded. As we will show later, the convergence rate for (2) depends on 
how fast the variance of hn goes to inhnity. The slower the variance goes to inhnity, the 
faster the convergence rate for (2) is. CRN has been observed ehective for variance reduction 
for decades. It is perhaps the most popular method for variance reduction [Bratley et ah 
(1983); Conway (1963); Fishman (1974); Hammersley and Handscomb (1964); Heikes et ah 
(1976); Kleijnen (1974); Law and Kelton (1982)]. 

The rest of the paper is arranged as follows: In Section 2 we examine the rates of 
convergence for the KW algorithm under a very general setting that covers many interesting 
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situations, the analysis is extended to the MD algorithms in Section 4. In Section 4 we 
show that the use of CRN can reduce the variance of by orders of magnitude, which in 
turn accelerates the convergence of the KW algorithm. In Section 5, we examine the rate 
of convergence for the MD algorithm under CRN. In Section 6 , we extend the results to 
multivariates. A practical example is given to illustrate the feasibility of applying CRN in 
practice. Finally, a summary is provided in Section 7. 

2. Rates of convergence for the KW algorithm. In this section, we examine the 
rates of convergence for the KW algorithm (2) under general assumptions on We do not 
assume that is of the form (3). We will see later that such a treatment covers several 
important cases. 

Assume that (5n > 0 goes to zero as n —)■ cxd and, for n > no > 1, hji satishes the following 
assumptions: 

(7) = J'iOn) + A,,, |AJ < 

and 

( 8 ) Var[hn\9n] < cS^, 

where b, c, P are real nonnegative numbers, 7 G R. The form of (7) assures that is an 
asymptotically unbiased estimate of J'{9) when > 0. When 7 > 0, the variance of the 
estimate goes to zero as n —)■ cxd. This is generally impossible in practice. When 7 = 0 such 
as in the RM algorithm, the variance is bounded. In the case that 7 < 0, e.g. 7 = —2 if 
hn is dehned by (3) and if and ^ 2 ,n are independent, the variance of hn goes to inhnity. 
Next we examine the convergence and the rate of convergence for the KW algorithm (2). 
The commonly used criterion for measuring the convergence of a stochastic sequence {9n} 
is the root mean square error (RMSE) defined as 

RMSEg^ = {E[{9^-9*f]y/^ 

If RMSEq^ = 0{n~^), s > 0, we say that {9n} converges at the rate of n~^ or the convergence 
rate for {9n} is n~^. 

We need the next lemma that was due to Chung (1954) and was formulated in the present 
form by Fabian (1971). 

Lemma 1. Let s, t, B, An, bn be real numbers, 0 < s < 1, t > 0, B > 0. Define b+ = 0 if 
s < 1 and b+ = t if s = 1 and assume that c = hm„^.oo An — &+ exists and is finite. If for 
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n > no 


B 


bn+l ^ ^n(l 


A-n 

^) + 


n- 


S-\-t 


and if c> 0, then 

lim supn*6n < B/c. 

n—>-oo 

The statement remains valid if all the inequalities are reversed and lim sup is replaced by 

lim inf. 

The following Theorems 1 and 2 give the convergence rate for the KW algorithm (2) with 
hn satisfying (7)-(8): 

Theorem 1. Assume that {9n} is determined by (2) and 
(A1). an = an~°‘, Sn = dn~^, 0 <a<l,ri>0, a, d>0; 

(A2). J{6) is increasing for 9 < 9* and decreasing for 9 > 9*, and there exist two constants 
Ki, K 2 , 0 < Ki < K 2 < 00 , such that for all 9 G Q, 


Ki\9-9*\ < \f{9)\ < K2\9-9* 


(A3), conditioned on 9n, hn at the nth iteration is independent of those at the other iterations. 
Then, if a = (1/2) min{Q; + 777 , 2[dri} and 0 < a < aKi, we have 

(9) lim sup — 9*Y] < C 

where C > is a constant. The convergence rate for RMSEq^ is at least n~'^. 

Proof. Without loss of generality, we assume that 9* = 0. Then 

BiCil = E[0l]-2a.,E%K\ + alE[hl] 

= E\el] - 2o„B[9„(J'(9„) + A„)] + al({E\h„]f + Var\h„]). 

According to (7)-(8), we have 

(10) i7[Ci] < E[9l] - 2anE[9nJ\9n)] + 2han5iE[\9n\] + 2al{E[J\9n)f + hHf) + caldl 
By Assumption (A2), 9nJ'{9n) > 0 and 

( 11 ) 9nJ\9n) > K,9l {J\9n)f < Kl9l. 
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Furthermore, for any e„ > 0, |6*„| < + Q\l^n and consequently 


Em]<i^ + -E\el]. 


By setting 0 < e < 1 and 


— 


2Mg 

ft'if’ 


we have 

( 12 ) 




Substituting (11) and (12) into (10), we obtain 


(13) BK+iI < BK1|1 - (2 - £)/fiO„ + 2Klal] + 2()=oJif + caji’ + 

According to (13), also noting Assumption (Al), we can choose an ni > no > 1 such that 
for all n> rii 

EieU < EielKi - ^) + 

where 

2 iF|a^ 2 r 2 72/9 2 I'v 

A„ = (2 — e)aiFi-, B = + ca^d^ H-—-. 

n" Kie 

If aKi > a, we can always choose e > 0 so small that (2 — e)aKi > 2a. Applying Lemma 1, 
we obtain (9) with C = B/{{2 — e)aiFi) if a < 1 and C = B/{{2 — e)aKi — 2a) if a = 1. I 


It follows directly from Theorem 1 that {On} converges to 6* as long as a > 0, or 
equivalently, as long as a + 777 > 0. When a + 77 < 0 which is possible only when 7 < 0, 
the variance of grows to inhnity at the rate of n* with t = —'yrj > a. It is obvious from 
(2) that {On} does not converge. Another extreme case is that 7 > 0. In this case, a can be 
made arbitrarily large by choosing appropriate rj. The convergence rate for {On} can be made 
arbitrarily large if rj can take any value. In fact, by setting 7 —)■ cx) in (13) (or equivalently, 
dn —t 0 ) and ttn = a such that 0 <g = l — (2 — e)K 2 a + 2 X^ 0 ^ < 1 for sufficiently large n, 
we have 

EVL,] < gEiell 

The convergence rate for the sequence {On} is that of a geometric progression. Unfortunately, 
this is a very special case. One should not expect 7 > 0 in practice. Both of the situations 
7 > 0 and q; + 77 < 0 are too special to deserve further study. The most interesting case is 
when 7 satishes —a/r] < 7 < 0. 
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Theorem 1 shows that, when hn satisfies (7)-(8), {On} converges with probability one to 
the optimal parameter 9* at a rate of at least n~^. We can further prove that {On} converges 
exactly at this rate as interpreted in the following Theorem 2. 

Theorem 2. Assume that Assumptions (A1)-(A3) are satisfied. 

1. Ifa + 'JT] < 2/37], aK 2 > a, E[hn\0n\ = J'{0n) + An, |A„| < Mf, and VarfifiOn] > c6/{, 
we have 

\\^m.in^^E[{0n -0*f]>Ci 

where Ci > is a eonstant. 

2. If a + 'yrj > 2/3rj, E[hn\0n] = J'{0n) + + Sn), and J'{0n) = {On - 0*){Ks + Tn), 

Sn = o(l) and Tn = o(l) uniformly as n ^ oo, = J"{0*) > 0, a < aK^, then 

lim sup — 0*] < —C 2 

where C 2 > 0 is a constant. 

Proof. Let consider the first statement. For simplicity and without loss of generality, 
we assume 0* = 0. Parallel to the derivation of (10) we have 

= E[0l] - 2anE[0n{f{0n) + A„)] + al{{E[hn]f + Var[hn]} 

which implies 

> E[0l] - 2anE[0nJ’{0n)] - 2ban6^E[\0n\] + caldl 

Assumption (A 2 ) implies that 0 < 0nJ'{0n) < ^ 26 ^^ which, together with ( 12 ) where Ki is 
replaced by K 2 , shows that 

452 

(14) sK+J > E[el](l - (2 - 0K,a„) + calSl - 

If CK + 7?7 < 2/37], there exists an no > 1 such that when n > uq 

_ ^!—n 0 > -ca'^S^ 

Therefore, we know from (14) that when n > no 

SK«1 > E[0l\O + 



Since aK 2 > cr, we can always choose e > 0 so small that (2 — e)aK 2 > 2a. The hrst 
statement of the theorem follows from applying Lemma 1 with Ci = ca?(P/{2{2 — e)aK 2 ) if 
a < 1 and Ci = ca?(P/ ( 2(2 — e)aK 2 — da) if a = 1 . 

If a + 7?7 > 2/3-1], we know that a = /3r] > 0 and 

E[en+i] = E[en]-anE[J’{en)]-baJ^{l + en) 

( 15 ) = E[9n]{l - Ksttn) + anE[TnOn\ - (1 + E[en\) 


Dehne Zn = n°'E[6n\. Then (15) shows that 

Zn+l = n‘^(l + ^)”"[^[6'„](1 - K^ttn) + anE[TnOn\ - 5a„5f(l + Sn)] 

= ^n[l H-- K^ttn + O^ — )] + (1 H- Y tE {anE[TnO r/\ — (1 + E[er/\))- 

n n n 


Denote 

C7^ LJ 1 

An = 1 H- K^ttn - K^ttn + 0{ — ), 

n n 

Bn = (1 + —Y^^ (bUnSnY + -^[^n]) ~ O-nE[Tn0n])- 

Then 

(16) Zn+l — AnZn Bn- 

Note that = an““,0 < a < l,Sn = dn~'^,aK^ > a. We may choose Ai,A 2 > 0,ni > 1 
such that, for all n>ni, 


(17) 


n" n n“ 



A 2 

n" 


Since Assumptions (Al)-(A3) in Theorem 1 are satished, lim„^oo sup < C which 

implies that 

lim supn°'|i7[6*J| < lim sup (n^'^i? [ 6 '^])^'^^ < \/C. 

n—>-oo '• n—>-oo l /tj/ 

According to the assumptions that = o(l), = o(l) uniformly as n 00 , and = 

d^n~^. There exists an n 2 > 1 such that, when n > n 2 , 

Bn = {l + ^''rE{han5i{l +E[en]) - anE[Tn9n]) 

> (1 + -YnYhandiY + E[en]) - a„^[|rJ]E[| 0 J]) 
n 

(18) > (1 + -Yrfhjandi > ]-rfbandY 

71/ Zi Z 
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Let no = max{ni,n 2 }. Then, from (16) we know that for all n > no 

n n—1 n 


(19) 


\ ^ ^ Bi Aj B^i 

i=nQ i=nQ j=i-\-\ 


Since 0 < o; < 1, (17) shows that 

n 

(20) 0 < ito n ^. < ito n (1 - ^)=O' 


" A^. 


n—^oo n—>-cx) - q*- 

i=no i=no 


Furthermore, hm„_j,oo-Bn = 0, and (17) and (18) imply that 


(21) n A>E^ n A>^E^r>^E(i- 4 )- 

i=no i=no j=2+l 

On the other hand. 


2 n" 

i=no 


2 n" 

i=no 


1 A. ( 1 

lim — E (1 —= I ' _A 
r,^r^ n“ n'^ 1 1 ~ e if a = 1- 


if 0 < a < 1 


1=110 


Substituting the preceding inequality, (20) and (21) into (19), we see that 


lim sup Zn < — O 2 

n—>-oo 

with O 2 = {l/2)ahd^{l — e~^^) > 0. This is exactly what we want to prove. I 

Theorems 1 and 2 show that the convergence rate for {dn\ is generally n~^. If we are 
free to choose the positive numbers a^rj, it follows directly from Theorem 1 that 

Corollary 1 . Assume that hn satisfies (7)-(8) and 7 < 0. Under Assumptions (A2)- 
(A3) in Theorem 1, the best possible convergence rate for the KW algorithm (2) is 
which is achieved by setting a = 1, p = l/(2/9 — 7 ), and by choosing appropriate a,d > 0. 

For the KW algorithm (2) with hn defined by (3), assume that J{6) is continuously 
differentiable of order up to three and the third order derivative J'''{9) is uniformly bounded 
on 0 , we have 

+ \r{e„)sl = + o(si) 

20n 0 

where 9n G [9n — dn, 9n + E- case, {3 = 2. If the assumptions ( 8 ) and (Al)-(A3) are 

satished and if the positive number a is chosen sufficiently large, we know from Theorems 1 
and 2 that 

1 

( 22 ) (T = - min{Q; + 77 , 47 }. 
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If we use the one-sided finite-difference approximation in (2): 


( 23 ) ^ L{X{9^ + Sn,^l,n))-L{X{9n,^2,n)) 

S2,n 

and if J{9) is twice continuously differentiable and the second order derivative J"{9) is 
bounded on 0 , then for any 9^ 5n there exists a 9n & [9n, 9n + (^n] such that 

E[h^\9] = - J{9n) ^ ^ \j"{9n)5n = J'{9^) + 0(5„). 

Therefore, /9 = 1. Under the same conditions as those in the previous case we know that 

(24) a = - min{a-|-yr], 27]}. 

It is clear from (22) and (24) that, under the same condition for the variance Uar[h„| 6 '„], 
the convergence rate of the KW algorithm is faster when symmetric differences are used 
than that when one-sided differences are used. Corollary 1 shows that the best possible 
convergence rate depends on two factors—how fast the bias decreases to zero and how slow 
the variance increases to inhnity. Using symmetric hnite difference (3) instead of the one¬ 
sided hnite difference (23) can reduce the bias of To summarize, we have the following 
conclusion which will be used later. 

Corollary 2. Suppose that (Al)-(A3) are satisfied. If 

(A4). J{9) is continuously differentiable of order up to three and the third order derivative 
.J'''{9) is bounded on 0, 

then the best possible convergence rate for the KW algorithm (2) with h^ defined in (3) is 
n-2/(4-7). If 

(A5). .J{9) is twice continuously differentiable and the second order derivative J''{9) is bounded 
on 0, 

then the best possible convergence rate for the KW algorithm (2) with hn defined in (23) is 
^-1/(2-7)_ 

3. Rates of convergence for the MD algorithm. The rate of convergence of the 
MD algorithms was established by Nemivoski et ah (2009) when the hn in (4) is an unbiased 
estimate of the derivative, and by Duchi et ah (2012, 2013) when the hn is approximated by 
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the one-sided finite difference (23). In this section, we examine the rate of convergence of the 
MD algorithm for general Again, we only assume that hn satisfies (7)-(8). For notational 
consistence, the norm ||.|| in (4) is taken as the I 2 norm. Its dual norm ||a;||* := sup||y||<]^ 
is also the I 2 norm. Define 

1 ” 

= — X! 


n 


2=1 


We next examine the convergence of J{0n)- 

Theorem 3. Assume that {6*„} is determined by (4), and 

(Bl). 'ipiO) is strongly convex, 0 is compact and convex, and there exists r > 0 such that 
D{e*, 6) < {l/2y, r>0 for all 6 G 0; 

(B2). L{X) is closed convex, and there exist two constants Ki,K2,0 < Ki < K 2 < 00 , such 
that for all 0 E O, 

Ky-0*i < \ j\e)\ < K2\e-e*\-, 

(B3). conditioned on On, hn at the nth iteration is independent of those at the other iterations. 


Then 

n n n n 

(25) ElJ{0„) - J(9-)] < El + ^ a,62 + tl ^ ^ , 

ly-i /~t rtn • * on • * on • * on • * 


na^ n 


2 = 1 


n 


2=1 


n 


2=1 


n 


2=1 


where 


^ _r^ ^_c Ky _b^ n _ 

re¬ 


proof. Under Assumptions (B1)-(B3), we know from Duchi et al. (2013), eqn. (13), 


that 

(26) 

Therefore, 


2 1 1 

J(9„) - J{0') < E - - E A.(('. - 0’ 


2na„ 2nK ~y 


n 


2=1 


2 1 ^ 1 

( 27 ) B[j(»„) - j(r)| < — + —Y^a,E[hl\ + -Y,E[\\(e,-e- 

^ i=l 


2nan 2nK 


The assumptions (7)-(8) give 

(28) E^] = Vary + {E[fy) + A,])^ < c6f + 2{E[fy)y + 2bHf 
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On the other hand, according to Assumptions (B1)-(B2), 


* m 2 ^ ^^2 Ji 


{E[j'{ei)]Y < {K^EWe, - e*\]y < 


(30) E[\Am-0*\]<-^. 

v 2 k 

By combining (28)-(30) with (27), we obtain 

2 r „ hr ^ „ 

Em,) - j(9*)] < _ + _ g ^ g _ g + _ g , 

which is exactly (25). I 

A special case of interest is 7 = 0, which corresponds to bounded variance of derivative 
estimation. The following Corollary 3 provides a bound on the convergence of the MD 
algorithm for this case with properly chosen {ttn}, {5n}- 

Corollary 3. Suppose that (A 4 ), (B1)-(B3) are satisfied. Let 7 = 0, a„ = 

Sn = dn~^, a > 0, d > 0. 

(a) For hn defined in (23), 


> + (2.5ftd^) “ + {c,d)t±Lm, 

\/n n n 


(31) E[J{en) - J(d*)] < (Cl + 2^2 + 2^3 


(h) For hn defined in (3), 

(32) E[Ji§n) - J{e*)] < (Cl + 2 C 2 + + (9C4dV7)- + (2C5d')-. 

^/n n n 

Proof. For 7 = 0, = dn~^, combining the first three terms in (25) 

gives the hrst term in (31). For dehned by the one-sided hnite difference (23), under 
Assumption (A4), we have 13 = 1. Consequently, the fourth term in (25) is 


C4 ” ^2 ^-40^2 


E 2 .-2.5 ^ 


•- 2.5 C4ad^2.5 


The last term is 


C 5 ” 


Csd ” ._3 C 5 d(l -f logn) 


'^0 r ' — 1 ^ 


^ iTi 




Combing the previous two inequalities with (25) gives (31). 
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For hn defined by the symmetric finite difference (3), under Assumption (A4), we have 
(3 = 2. Consequently, the fourth term in (25) is 


9± 

n 


i=l 


C^ad* 

n 


n 



i=l 


< 


C4ad\9/7) 

n 


The last term is 


n 


C,d^ ^ 2C,d^ 


n 


n 


i=l i=l 

Combing the previous two inequalities with (25) gives (32). I 


Duchi et ah (2013) investigated the convergence of the MD algorithm using the one-sided 
hnite difference (23) as an approximation to the derivative. The bound (31) is technically 
the same as that in Duchi et al. (2013). When the symmetric hnite difference (3) is used, the 
logn factor disappears in the last term of (32), which indicates that the symmetric hnite- 
difference approximation (3) leads to a tighter bound under similar assumptions, which is 
due to that the symmetric hnite diherence (3) typically provides more accurate estimate of 
the mean of the derivative than the one-sided ones do. Note that Duchi et al. (2012, 2013) 
implicitly assumes that CRN is used in calculating the hnite diherence (3) or (23) that will 
be covered in Sections 4 and 5. 

It’s worth of noting that the rate of convergence for the MD algorithm, as given by 
(25), is which is not ahected by the choice of hnite-diherence approximation, either 

symmetric or one-sided, to the derivative. This is by design since the MD algorithm was 
originally proposed for improving the robustness in the choice of stepsizes at the cost of 
slower convergence. 


When a hnite diherence is used to approximate the derivative, it is desirable to have 
—)■ 0 as n —)■ cx) to ensure asymptotically unbiased estimate of the derivative. In this case, 
it is possible (and likely in practice!) that the variance of the estimate goes to inhnity. This 
is a special case of (25) with 7 < 0. Therefore, Theorem 3 allows hexibility to cover general 
cases. 

A special situation is when L{X{9n + 5n,Ci,n)) and L{X{9n — dn,^ 2 ,n)) or L{X{9n, ^ 2 ,n)) 
in (3) or (23) are sampled independently. In this case, 7 = — 2 . Assume further that {an} 
and {(5„} are specihed as in Assumption (Al). Then the right hand side of (25) becomes 


/^/on /^n /^n /^n 

, ^2 , ^3 . r2/3 , ^5 


H{n) :=^ + ^^aA" + -E«* + -E«*C + -E'^; 


na 


n 


i=l 


n 


i=l 


n 


i=l 


n 


i=l 
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where 


/^/^n /^n /^n ^ n 

= ^ V + — E «*■“ + — E + — E 

Q^j-^l—a rn ^^ ri ^ ^ ^n ^^ 'n -^-—^ 


n 


n 


n 


n 


i=l i=l i=l i=l 

= 0(n-i+“) + 0(n-“+2^) + 0(n-“) + 0(n-“-^'^) + 0(n-^'^) 


= 0(n-n, 


a = min{l — a,a — 2ri, a,a + 2/3ri, I3r]} = min{l — a, a + 2ri, /3r]}. 

For the one-sided hnite difference (23), /9 = 1. Then 

a = niin{l — a,a + 2r], 77 } < 1/4. 

For the symmetric ffnite difference (3), /9 = 2. Then 

a = min{l — a,a + 2rj, 27]} < 1/3. 

The previous discussion can be summarized in the following Corollary 4. 

Corollary 4. Assume that Assumptions (Al), (A 4 ), (B1)-(B3) are satisfied, and that 
L{X{ 6 n + 5n-,^i,n)) and L{X{9n - Sn, ^ 2 , 11 )) in (3) (or L{X{ 6 n,^ 2 ,n)) in (23)) are independent. 
Then 

(i) the best possible rate of convergence for the upper bound H{n) is when the one¬ 

sided finite difference (23) is used, 

(a) the best possible rate of convergence for the upper bound H{n) is when the sym¬ 

metric finite difference (3) is used. 

Note that the rates of convergence are only upper bounds of E[J{9n)]. Such rates of 
convergence are consistent with those for {9n}. 

4. The KW algorithm with CRN. In this section, we will show how CRN can 
accelerate the convergence of the KW algorithm. For clarity and without getting trapped 
into unnecessary tediousness of details, we focus our attention on the case in which G i? is 
a real one-dimensional random variable. In Monte Carlo optimization, is usually a psuedo- 
random number generated by a computer. For most applications, such a pseudo-random 
number is sufficiently good to be regarded as a random number uniformly distributed on 
[0,1). In Section 6 , we extend the results to general situations. 
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To avoid repetition, we only consider the hn defined as in (3) with = ^ 2 ,n = Cn- 
The analysis is applicable to the one-sided finite-difference approximation (23) without any 
difficult. For a given On, hn is a ffnite-difference approximation to the derivative J'{0) at 
9 = On- For simplicity, we omit the subscript n. Then 

(33) h =- - -, 

The mean of h is 

26 

which is the same as that of (3) without the use of CRN. However, the variance of (33), as 
we will show, is generally smaller than that of (3) without the use of CRN when <5 > 0 is 
sufficiently small. We will also show that the reduction in the variance of h may have a sig¬ 
nificant impact on the convergence rate for the KW algorithm for Monte Carlo optimization. 
Toward that end, we need to specify the generation of the random variable X{d,^) with a 
given distribution F{0,x). Next we examine the variance of (33) for several popular random 
number generation methods. Note that 

Varlh] = + i.O) - L{X{B - i,0))"] + (■](» + S) - J(« “ 

If J{d) is continuously differentiable on 0 with bounded derivatives, then 

(34) Varlh] = ^B|(L(A'(« + i,?)) - L{X(B - i,^)))^] + 0(1). 

4-1. Inversion method. Inversion is one of the most popular methods for random 
variable generation. Let F{0,x) be the distribution function of X{0,^). The inversion 
method generates the random variable X{0,^) in the following way: 

1 . Generate a random number ^ uniformly distributed on [0,1). 

2 . Setx{e,o = F-Ho,0- 

Then it is straightforward to verify that X{0,^) has the desired distribution. Note that the 
mapping F{6,x) : i? —)■ i? is not one to one in general. To ensure its existence for general 
distribution functions, the inverse function is defined as 

F~^{9,^) = minja; | F{9,x) > ^, x E R} 
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which is different from the usual definition [see Krantz (1991)]. It coincides with the usual 
definition if F{d,x) is continuous and strictly increasing. Such a definition of the inverse 
function covers both continuous and discrete random variables. For example, consider a 
discrete random variable X{9,^) = Xi with probability Pi{0). Define po{9) = 0,pi{9) = 
for i > 1. Let ^ be uniformly distributed on [0,1). The inversion method gives 
= a;* if G [pi-i{9), pi{9)). Then direct verification shows that X{9,^) obeys the 
desired distribution. This is a discrete version of the inversion method. 

In order to proceed with our discussion, let us first examine the properties of distribu¬ 
tion functions. A distribution F{9,x) is a nondecreasing and right-continuous function of 
X. F{9,x) has at most countably many points of discontinuity on R and all of the dis¬ 
continuities are of the first kind — that is, for any x E R, F{9,x~) = limy^x F{9, y) and 
F{9,x~^) = \imyixF{9,y) exist and are finite [e.g. Krantz (1991),149-150]. Therefore, we 
can divide R into \JiBi{9) = R, where Bi{9) = [bi{9),bi+i{9)), such that, for each i, F{9,x) 
is continuous on Bi{9), but jumps at bi{9). Assume that, for each i, F{9,x) is piecewise 
differentiable on Bi{9). Then F^{9,x) > 0 whenever it exists. We further divide the interval 
Bi{9) into subintervals according to whether the derivative of F{9,x) with respect to x is 
zero or not. For simplicity, we assume that Bi{9) = B^{9) {jBf{9) such that F^{9,x) = 0 on 
Fi{9) = [bi{9) , Ci{9)] and F{9,x) = Fi{9,x) is continuously differentiable with strictly posi¬ 
tive derivatives on B^{9) = (cj(6*), &j+i(6*)). It is possible that bi{9) = Ci{9). On B'^{9), the 
derivatives F^{9,x) should be understood as the right and the left derivatives at bi{9),Ci{9), 
respectively. It is possible that F{9,x) is not differentiable at Ci{9). The inverse F~^{9,^) 
is defined in the usual sense. It is continuous, strictly increasing, and differentiable on 

{F{9,c0)),F{9,b^,{9)))- 

Under the preceding decomposition, F{9,x) is discontinuous at bi{9), is a constant on 
B^{9), and is strictly increasing and differentiable on B^{9). 

The following Lemma 2 follows directly from the definition of the inverse function and 
the decomposition of F{9,x). 

Lemma 2. Let X{9,^) be defined by the inverse function X{9,f) = F~^{9,f). Let 
Ei{9) = [F{9,b~{9)),F{9,b~^fi9))). ThenEfid) C [0,1) and for any f E Efid) 

r bfi9), ^ffE[F{9,b-{9)),F{9,cfi9))), 

(35) cfi9), ^ff = F{9,cfi9)), 

[ F-\9,0, ^f^E{F{9,cfi9)),F{9,b-,fi9))). 
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We need the following result. 

Lemma 3. Assume that 

(Cl). L{X) and L'x{X) are bounded, J{ 6 ) is continuously differentiable on 0; 

(C2). for each i, FfO^x) is continuously differentiable on Bf{9) with strictly positive deriva¬ 
tives with respect to x, and 

Y,E[{max{F’g{e,x)y/F’^{e,x))Is+(^Q-^] < oo; 

i 

(C3). hi{ 6 ) is continuously differentiable in 9, and J 2 imax 0 {b'f 9 )Y < oo; 

(Cf). for each i, the functions F{9,Ci{9)) and F{9,b~{9)) are continuously differentiable in 
9, andJ2i^^^e\F'{9,Ci{9))\ < oo, |F'(6*, 6“(6*))| < oo, 

Define Mi{9) = 2J2i{L{ci{9)) — L{bi{9))fi\F'{9,Ci{9))\. Then Mi{9) > 0 is bounded for all 
9. If Ml {9) > 0, we have 

(36) E[{LiXi9 + 6, 0) - L{Xi9 - 6, 0))1 = Mii9)6 + o(<5) 

as (5 > 0 goes to zero. 

Proof. We calculate 

Urn fimxio + 6,0)- L(X(e - o))"'! 

S^o 0 

= Urn Wi5|(L(A'(« + i.O) - L(X(0 - ?)))%.(«)]■ 

i 

Let 

R,(0,S) = jE[(L{X(0 + i,«)) - L(X(0 - SX)) fh.i»-s)V 

Then 

(37) lmjiB|(L(A'(« + 5,0) - L(A(» -5, {)))"] = 

i 

Next, we prove that the limit and the summation commute. Define Di i = Sj(6'—5) n[0, E{9f 
b, b-{9 + 5))), A, 2 = Efi9 - 5) n Efi9 + 5), and A, 3 = Ef9 - b) n[E{9 + b, 6 -+i(^ + 5)), !)• It 
is possible for each of Dij,j = 1, 2, 3, to be empty. Then Ei{9 — 5) = U Eij and 

(38) Rfi9,b) = E Aj, R^,J = ^E[{L{X{9 + 5,0) - L{X{9 - 5,0))'A,,]. 

.j=i " 
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By Assumption (Cl), there exist A^i,A "2 > 0 such that \L{X)\ < Ni, \L'x{X)\ < N 2 . 
Therefore, 

R^,l < {2N,y\F{e + 5,b-{e + 5))-F{e-5,b-{e-5))\/5 
< 2 { 2 N,fmax\F'{e,b-{e))l 

u 

< {2N,f\F{e + 6,b-^,{e + 6))-F{e-6,b-,,{e-6))\/6 

< 2{2N,rme.x\F'{e,b-^,{e))\, 

Without loss of generality, assume that F{d + S,b~{d + S))) > F {6 — 6 ,b~{d — S))) and 

F{e + 5,+ 5))) < F{e - 5, b-+,ie - 6 ))). If F{e + s, die + 6 )) > Fie - 6 , die - s)), 

1 .Fie-SMd-s)) 

(39) = -J iLixie + s,0)-Lixie-6,0)M 

0 JF(e+5,b-(e+S)) 


1 rF{e+S,Ci{d+S)) 
6 JF{e-s,ci{e-s)) 

6 JF{e+5,ci{e+5)) 


iLixie + 5,0)-Lixie-5,0)Yd^ 


iLixie + 6,0)-Lixie-6,0)M 


1 rF{e-5,Ci{0-5)) 

6 JF{e+s,b-ie+s)))' 

1 .F(0+5,Ci(0+5)) 

6 JF{e-5,ci{e-&)) 

1 .F{e+5,b-^^{e+5)) 

5 JF{e+5,ci{e+5)) 


iLihie + 5)) - Lihie - 5))fdi 
iLihie + 5))-LiF-\e-5,i))fdi 
iLiF-\e + <5,0) - LiF-\e - 5, i))fdi 
The first two terms of (39) are bounded respectively by 


4A^22 i2Nif m.&x\F'ie,die))\- 

6 6 

The third term of (39) can be rewritten as 

1 fF(e,ci(e)) 

T / iLiF-\e + 5,i))-LiF-\e-5,i))fdi 

0 JF(e+&,ci{e+&)) 

1 i.F(e+5,b- (e+s)) 

+-/ ; iLiF-\e + 5,i))-LiF-\e-5m"di 

1 rF{S,b-{S)) 

+T I " iLiF-\e + 6,0)-LiF-\e-6,0)M 

d JF(9,ci(e)) 

< 2 i 2 Nif max |F'(0, die))\ + 2(2W)^ max |F'(0, 6"+i(0))| 

6 0 
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+4iV|£|(max(f;;(9, 

Therefore, Ri ^2 is bounded by 

4N^ max{b[{e)fS + 4{2N^f max \F'{e, Ci(0))| 

9 6 

+2(2Af,)" max |F(«,(,-„(»))!+ 4/V2"i5|(max(Fi(«,i))VF4(«,i))/sa,„l|4 
Similarly, we can prove that if F{6 + 6, Ci{6 + 5)) < F{6 — 6, Ci{6 — 5)), 

1 fFi9-\-S,Ci(9-\-S) 

(40) i ?,,2 = J {L{k{e + 6)) - L{k{e - S))M 

d JF{e+s,b-(e+s))) 

1 fF(e-s,ci{e-5)) 

+ -/ {L{F-\e + 5,i))-L{h{e-5))fdi 

0 JF(6+&,Ci{e+&)) 

1 rF(e+s,b-(e+5))) 

+ -J {L{F-\e + 6,0)-L{F-\9-5,0)rd^ 

0 JF{e-5,ci{e-b)) 

< 4NI max(6'(6*))^5 + 4(2iVi)^ max \F'{9, Ci{6))\ 

9 9 

+ 2(2iVi)2max|F'(0,6-+i(0))| 

9 

+ x)f/F'je, 

Substituting the upper bounds for Rij,j = 1,2,3, into (37), also noting the assumptions 
(C2)-(C4), we see that Ri{9, 5) is uniformly bounded with respect to S. Therefore, J2i Ri{9, 5) 
converges uniformly in (0, (5o) for any 5o > 0. By the Weierstrass M-test [Krantz (1991),211], 
we know that the limit and the summation commute. From (37), 

(41) linj \E\(L(X(e + 4,0) - L(X{0 - S, 0))=] = Jim Z «.(». 4) = S Jim 4). 

i i 

We next calculate lim^^o Ri{9, 5). For each i, there exists a <5* > 0 such that for any 5 < 5i 

\F{9 + 5, b-{9 + 5))) - F{9 - 5, b-{9 - 5)))| 

<1 . mm {F(»-4,4-+,(»-4))-F(«-4,4-(»-4))}, 

4 ^=2—1,2,2+1,2+2 

Note that Di^i = 'Ei{9 — (5)nSj_i(6' + 5) and = '^i{9 — (5)nSi+i(6' + 6) when <5 < 6i. 
Therefore, by taking into account that each of Di i and 71*^3 may be empty, we have 

I^F{e+5,b- ( 0 +< 5 )) 

R.,, < / (L{h{9 + 5))-L{F-\{9-5,0)M 

JF{e-5,b-{e-5)) 

< max \F'i9, 6-(0))|(L(6,(0 + 5)) - L(F;+(0 - <5, f)))^ 

= 0(1) 
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where f G [F{d — S,bi {6—S)),F{d+S,bi (6*+(5))). Similarly, Ri^s = o(l). Hence, lim^^o-Djj = 
0 for j = 1, 3. Also, the analysis of (39) and (40) shows that 

(42) hm A ,2 = 2 (L(q(0 )) - L{k{e))Y\F\e,c,{e))\ 

0—>-0 

Substituting (42) into (41) we get 

lim i£|(L(X(9 + i,«)) - L{X{0 - = M,(9) 

(J-j-o d 

which is exactly what we want to prove. I 

The proof of Lemma 3 shows that Assumptions (C2) and (C3) guarantee that the inverse 
function F~^{9,^) is sufficiently smooth. Assumption (C4) ensures the existence of Mi{9). 
Assumptions (C2)-(C4) are mild. Assumption (Cl) guarantees the smoothness of the func¬ 
tion L{X). The boundedness of L{X) and L(^(X) can be removed if there are only a hnite 
number of sets of Bi{9). The hniteness of Bi{9) can also relax the assumptions (C2)-(C4). 

The case of Mi{9) = 0 can only occur when either bi{9) = Ci{9) or F'{9,Ci{9)) = 0. The 
situation of bi{9) = Ci{9) (assuming that L{X) is not a constant) happens when F{9,x) is 
strictly increasing. A repetition of the proof of Lemma 3 yields that 

Corollary 5. If Assumption (Cl) is satisfied and 

(43) El(Fi(e,x)y-/F^(e,x)]<oo, 

then EmX(9 + 0) - i(X(« - {)))2| = O(h^). 

Corollary 5 recovers a result obtained by Glasserman and Yao (1992) under the assump¬ 
tion of Lipschitz continuity of L{F~^{9,f)). When F'{9,Ci{9)) = 0 for all i, using the same 
arguments as that of Corollary 5 we can establish that 

Corollary 6. In addition to Assumptions (C1)-(C4), assume that F{9,Ci{9)) is eon- 
tinuously twiee differentiable for all i with 

(44) 0 < YxiHcm - L{bfi9))fi\F''{9,cfi9))\ < oo. 

i 

Then, E[{L{X{9 + 5,0) - L{X{9 - 5, 0))'] = 0{5^). 

The following Theorem 4 is the main conclusion of this subsection. 
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Theorem 4. Assume that Assumptions (Al)-(A4) and (C1)-(C4) are satisfied. If 
Mi{ 6 ) > 0 for all 9, then the best convergence rate for the KW algorithm (2) with hn defined 
by (33) is . This rate is attained by choosing a„ = an~^, a > 2/{5Ki), and 6 n = . 

Proof. Under Assumption of (C1)-(C4) and Mfid) > 0, we know from Lemma 3 that 
E[{L{X{9 + 5,0) “ L{X{9 — 5,0))^] = Mi{9)6 + o{ 6 ). According to (34), the variance of 
hn is of order Var[hn\0n\ = Mi{6n)/5n + o{l/5n). Lemma 3 shows that MfiO) is bounded. 
Therefore, 7 = —1 in ( 8 ). Since (Al)-(A4) are satished. Corollary 2 shows that the best 
convergence rate is . I 

The following Theorem 5 summerizes the rate of convergence of the KW algorithm (2) 
when (3) is replaced with one-sided hnite difference approximation with CRN. The proofs 
are omitted since they are very similar to that of Theorem 4. 

Theorem 5. (I) Under the same conditions as those of Theorem 4 but the estimate h 
is replaced by the following one-sided finite difference with the use of CRN 

( 45 ) ^ ^ HX(,e + i,!;))-nx(B,!;)) ^ 

6 ’ 

the best convergence rate is which is achieved by setting a„ = an~^,a > l/3Ki, and 

6n = dn~^^^. 

(II) Assume all the assumptions of Theorem 4 except that MfiO) = 0. Then Corollaries 5 
and 6 show that E[{L{X{9 — L{X {6 — = 0(5^) if either of (43) or (44) holds. 

Hence, Var[hn\0n] = 0(1) forhn defined by either (33) or (45). The best convergence rate for 
the KW algorithm (2) is j'hig 5 g attained by setting a„ = an~^,a > l/2Ki, 

and 6 n = dn~^,rj > 1 / 2 . 

We would like to emphasize that the assumptions in Corollaries 5 and 6 are satished 
for a broad class of stochastic optimization problems [see Glasserman and Yao (1992) for 
a discussion]. Theorems 4 and 5 state that, when the inversion method is used in the 
generation of random variables and when h is dehned by (33), the convergence rate for the 
KW algorithm with CRN is general and is for a large class of problems that 

satisfy the assumptions in Corollaries 5 and 6. The improvement is signhcant since the best 
possible rate for the same KW algorithm without CRN is 

4 . 2 . Rejection method. Let f{ 6 ,x) be the density function of X{ 6 ,f). Assume that, for 
all 6 * G 0 , f{ 0 ,x) is zero outside a hnite interval [a, 6 ] and is bounded by 0 < f{ 0 ,x) < c. 
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c > 0 is a constant. The rejection method generates X{6, according to the following three 
steps: 

1 . Generate uniformly distributed on [a,b]. 

2 . Generate ^2 uniformly distributed on [0,c]. 

3. If ^2 < then set X{9,^) = otherwise go to 1 . 

The rejection method uses at least two random numbers .^1 and .^2 to generate X{d,^). The 
total number of random numbers ^ 1 , ^2 required before outputing X{d, is a random value. 
The rejection method does not accurately meet the GRN requirements since it is impossible 
to dehne X{6 + 6,^) and X{d — S,^) using a hxed set of uniform random numbers [Bratley et 
ah (1983); Franta (1975)]. Therefore, we modify the dehnition of GRN in the sense dehned 
by the following procedure for the generation of a paired random variables: 

Generation of X{6 + 6, and X{6 — 

1 . Generate .^1 uniformly distributed on [a,b]. 

2 . Generate ^2 uniformly distributed on [0,c]. 

3. If 6 < - 6,6 ) and 6 < 1(9 + S, 6), then set X{9 -6,0= X{9 + 6,0 = 0- 

4. If 0 < /(^ ~ and 0 > f{d + 6,0): then set X( 6 * — 5, .^) = .^1 and generate a 
X( 6 * + (5, ^) = ,^3 by the rejection method. 

5. If 0 > fi9 — 5,.^i) and 0 < fi9 + 5,.^i), then set X( 6 * + 5,.^) = .^1 and generate a 
X{9 — 6,0 = 0 by the rejection method. 

6 . If 0 > /(^ - 6, 6 ) and ^2 > f{9 + 6, ^ 1 ), go to 1 . 

This is essentially a coupling procedure [see Devroye (1990) for a discussion on coupling]. 
Such a modihcation is necessary to mimic the scheme of GRN using the rejection method. 
We will soon see that even such a loosely dehned scheme can accelerate the convergence of 
the KW algorithm. Let X{9 — 6,0:X{9 + 6,0 be generated by the preceding procedure. It 
is obvious that E[h] for h in (33) remains the same as that in the inversion method. 

Theorem 6. Suppose that f{9,x) is zero outside [a,b], 0 < f{9,x) < c for all x G [a,b], 
and X{9 — 6,0: X{9 + 6,0 ore generated by the previously deseribed proeedure. Assume that 
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(HI). Var[L{X[6.,(())] is continuous in 9 e 0; 


(H2). f{0,x) is differentiable in 6 for each x G [a, 6], f{0,x) satisfies the Lipschitz condition 
with respect to 6, i.e., there is a K{x) such that \ f{9 + x) — f{9,x)\ < K{x)6, and 
that Jl)K{x)dx < oo. 


Define 


M2{9) = 


Var[L{X{9,m 


\fl){e,x)\dx. 


2 c{b — a) 

Then 0 < M2{9) < oo. If M2{9) > 0 for all 9, Var[L{X{9,f,))] is bounded, h is defined 
by (33), and the assumptions (Al)-(Af) are satisfied, then the convergence rate for the KW 
algorithm with CRN is 

Proof. We see from the procedure of generating X{9 — 6,(,) and X{9 + 6,^) that, 
conditioned on either ,^2 < f{9 — 5, ■Ci) or (^2 < f{9 + 5, ^ 1 ), X{9 + 5,0 = ~ 5,0 = 0 

when 0 < /(^ ~ 5,0) and 0 < /(^ + 5,0)j otherwise X(6' + 5,0 = Cs and X{9 — = 0- 

Note that 0 and 0 are independent. Therefore, 


(46) Var\h] = +(l/or|L(&)] + rarlL({4)])+ll/(» + ^.?i) - f(» - 

4o^ c 

Under Assumption (HI), Uar[L(0)] + Uar[L(0)] = 2Uar[L(X(6*, 0)]+o(l). By Assumption 

(H2), 

H\\f(0 + 6) - /(9 - 6)11 < /‘ K(x)dx 

0 0 — Qj Ja 

and K{x) is integrable on [a,b]. According to the Weierstrass M-test, (46) implies that 

''“’'['‘1 = ^ rar | L ( A '( 9 , 0 ) l +| l/(9 + 66 ) 6 ) 11 + 0(0 

= ll/ar[L(A'(9,{))l^(jO) / + “0 

Thus, we know from Corollary 2 where 7 = — 1 that the conclusion follows. I 


For simplicity, we only consider the simplest form of the rejection method and the case 
in which f{9, x) is continuous. An analysis similar to the one used in the proof of Theorem 6 
shows that Var[h] = 0(1/5) remains valid in the following three situations: (i) The estimate 
h is replaced by the one-sided hnite difference (45); (ii) The density function f{9,x) is 
piecewise differentiable; (hi) The rejection method is replaced by the following generalized 
rejection method. Assume that there exist a positive constant A and a density function g{x) 
such that f{9, x) < Ag{x) for all 9 and for all x G [a, b]. Then 
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1. generate with the density function g{x)-^ 

2. generate ^2 uniformly distributed on [0,745f(,^i)]; 

3. if ^2 < then set X{6,^) = ,^ 1 ; otherwise go to 1. 

It is easy to verify that X{9,^) has the desired distribution. The density function g{x) 
should be chosen such that it is easier to generate a random variable with g{x) than those 
with f{e,x). 

Generally speaking, the convergence rates for the KW algorithm are the same when either 
the inversion method or the rejection method is used in the generation of the random variable 
X{6,^). However, the rate corresponding to the use of the rejection method is universally 
true for any function: It can be seen from its dehnition that M2{0) is always positive except 
when Var[L{X)] = 0 or when f{0,x) is independent of 6. Both cases are of little practical 
relevance. Furthermore, assume that the assumptions in Theorem 6 are satisfied and, in 
addition, f{9,x) is strictly positive on (a, 6). Then the best possible convergence rate for 
the KW algorithm is when the rejection method is used in generating X{9, ^). On the 
other hand, the distribution function F{9,x) = f {9, u)du is continuously differentiable 
and strictly increasing on [a,b]. Theorem 5 shows that the convergence rate for the KW 
algorithm is if the inversion method is used in generating X(9,^). Therefore, as far 

as the convergence of the KW algorithm is concerned, the inversion method leads to faster 
convergence than the rejection method. This conclusion is in favor of the argument that the 
inversion method is superior to the rejection method [c.f. Bratley et ah (1983), 141]. 

4-3. Composition method. Assume that the distribution function F{9,x) of X{9,^) is 
of the form 

m 

F{9,x)=Y.P0)F,{9,x) 

i=l 

where Pi{9) > 0,m < 00 , J2iPi{9) = 1, and for each i, Fi{9, x) is a distribution function. The 
composition method generates the random variable X{9,^) in the following way: 

1. Generate a random variable Y with distribution Proh{Y = i\ = Pi{9). 

2. If K = i, generate X{9,^) according to distribution Fi{9,x). 

In the composition method, there is no specihcation on the method for the generation 
of random variables at each step. Any method such as inversion and rejection can be used. 
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As an example, we consider the case in which random variables are generated nsing the 
inversion method which is snperior to the rejection method, as we have argued in the previous 
subsection. Dehne po(^) = 0,pj(6*) = for i > 1. The following procedure is the 

actual composition method we are considering. 

1. Generate a random number uniformly distributed on [0,1). 

2. If ,^1 G [pi-i{6), pi{6)), then generate a random number ^2 uniform on [0,1) and set 

x{e,i) = F-\e,i2). 

In this algorithm, we need two uniform random numbers in the generation of X{d,^). 
Actually we can do with only one random number by setting ^2 = (Ci ~ Pj-i(^))/Pi(^)- 
Direct verihcation shows that, conditional on ,^1 G [pi-i{6) , pi{6)) , ^2 is uniform on [0,1). 
In the composition method, we regard that X{6 — S,^) and X{d + S,^) conform the CRN 
requirement if they are generated by the preceding procedure using the same ^ = {^ 1 ,^ 2 )- 
We can prove that it can accelerate the convergence of the KW algorithm. 

For simplicity, we assume that, for each i, Fi{9,x) = Fi{x) is independent of 6*, and the 
number of distribution component is hnite, i.e., m < 00 . The case in which F{6,x) is of 
general form can be treated parallel to the proof of Theorem 4. Our aim here is to find special 
features of the decomposition method rather than to develop the complete theory which is 
not difficult to derive. We hrst consider the situation where .^1 and .^2 are independent. We 
then examine the case where ^2 = (■Ci ~ P*-i(^))/p*(^)- 

Theorem 7. Suppose that ,^1 and ^2 o^re independent, Pi{6) is differentiable in 9, and 

m 

huo) ^Y.EmFpM)) - i(Fr‘K)))tip;(«)i 

i=l 

exits and is finite. If Mfi9) > 0 is bounded from above for all 9, h is defined by (33), and 
the assumptions (Al)-(Af) are satisfied, then the convergenee rate for the KW algorithm is 

Proof. The proof is a simplihed version of that of Lemma 3 since the assumptions here 
are stronger. According to the generation scheme oi X{9 + S, and X(9 — we have 

(47) El(L(X(e + 4,0) - L(X(e - 4,0))"] 

"7 fPi{e-s) 

= T. , „ + 4,0) - MX(D - 
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Note that m is finite and, for each i, pi{0) is continuous. There exists a (5o such that when 

S <So 

(48) \pi{6 + (5) - pi{d - 5)1 < ^ nfin{pj+i(6' - 5) - pj{9 - 5)}, for all i. 

Let us first consider the case in which pi-i{9 + 5) > pi-i{9 — 5) and pi{9 + 5) < Pi{9 — 5). 
When 5 < 5o; (48) ensures that (47) can be rewritten as 

i=l Jpi-ii9-S) 

™ rPi{e+s) 
j_l Jpi-i{9+S) 
m fPi{e-5) 

+ Y / £'fe|(L(A(« + i,0)-i(A(»-i,0))"]<i6 

Jpi{e+&) 

= E / , - r(fr'( 6 )))V 6 

i=i 9pi-i(e-5) 
iTi rpi{e-5) 

+ E/, , 

Jpi{0+5) 

m 

i=l 

m 

+1: Fl(L(FpiK)) - r(F-‘«)))>'(«) 2 i + o(i) 

i=l 

By considering every case ofpj_i(6' + 5) <Pi-\{9 — 5) andpi(6* + 5) < Pi{9 — 5), pi-i{9 + 5) > 
Pi-i{9 — 5) and pi{9 + 5) > pi{9 — 5), and Pi-i{9 + 5) < Pi-i{9 — 5) and pi{9 + 5) > pi{9 — 5), 
we obtain that 

B|(L(A'(9 + i,«)) - L(A(9 - = M3(9)^ + o(^). 

It follows from (34) that Var[h] = {l/A)M^{9)/5 + o(l/5). Applying Corollary 2, it is easy 
to see that the convergence rate for the KW algorithm is . We have thus completed 
the proof. I 

Similarly, we can prove that the rate remains valid for the case in which the random 
variable X{9,^) = F~^{9,^2) is generated by setting ^2 = (■Ci ~ Pi-i{(^))/ P0) ■ 

Theorem 8. Suppose that Pi{9) > 0 is differentiable and the following funetion exists 
and is finite for all 9: 

m 

Am = i:[L(F.;‘,(l-)) - F(F-‘(0+))]>'(»)|, 
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where (,^) and ^( 0 - > 0 is bounded for 

all 9 and (Al)-(A4) are satisfied, then the convergence rate for the KW algorithm is . 


The previous Theorems 7 and 8 show that the convergence rate for the KW algorithm 
is when the composition method is used. This rate does not depend on how many 

random numbers are used in the generation of random variables. We would emphasize that 
it is unlikely for each of Mfi 6 ), MfiO) to be zero in practice. 


5. The MD algorithm with CRN. In this section, we examine the rates of con¬ 
vergence for the MD algorithm under CRN. As shown in the previous section, the use of 
CRN largely affects E[hff\ and thus the reduction of the variance Var[hfi\- The analysis in 
the previous Section 4 provides direct information on E[hff\- Therefore, in this section we 
directly work E[hff\ without going through Var[hfi\- We may represent E[h‘fi\ in the following 
form. 

(49) E[hl] < 161 

Recall that 7 = —2 for independent samplings of X{ 6 ,f) without CRN. With CRN, 7 = —1 
if Mi{ 6 ) > 0 and 7 = 0 if MfiO) = 0. By following the same arguments as in the proof of 
Theorem 3 and applying (49) directly for E[hf\ in (27), we obtain the following theorem. 

Theorem 9. Assume (B1)-(B3) and (49). Then we have 

^ ^ n ^ n 

(50) E\.j(e,.) - J(r)] < -!- + ^ + ^1:7, 

where C 2 = c/{2k), Ci and C 5 are specified in Theorem 3. 

The following Corollary 7 summarizes the rates of convergence for the MD algorithm 
with using CRN in calculating the hnite difference (33) and (45). 


Corollary 7. Assume (B1)-(B3) and (49). Denote 

^ n n 

A = —+ -E“.A' + -E'57 


TIOjyi ^ 2 = 1 


n T=1 


Then 


(i) if'^ = —1, the best possible rate of convergence for the upper bound H{n) is n when 
the one-sided finite difference (45) is used, 

(a) if'y = —1, the best possible rate of convergence for the upper bound H{n) is when 
the symmetric finite difference (33) is used. 



(in) i /7 = 0, the best possible rate of convergenee for the upper bound H{n) is n when 
either the one-sided finite differenee (f5) or the symmetrie finite differenee (33) is used, 


6. Generalization and applications. In Sections 4-5, all the results are obtained 
for one dimensional random variables only. In this section, we extend the results to a case 
of multivariates, which is not difficult but very tedious. Assume that J{6) = E^[L{X{9, (,))], 
where the multidimensional random variable X{9,f) = [Xi{9,f),X2{0,f),...,Xm{0,f)]'^ G 
j^m Pqj, Xi{9,f,) = Xi{9,f,i) G R, fi is uniform on [0,1). We only consider the case 

in which each Xi{6, is generated from using the inversion method. To avoid repetition, 
we list the result without proof which is very similar to that of Theorem 5. 

Assume that J{9) G R and 6 * G 0. For each i, 1 < i < m < 00 , let Fi{6,x) be the 
distribution function of with the decomposition that 

dFi{9,x) ^ f 0, if a; G = [bi,j{9),Cij{9)] 

dx \ fi,j{9,x), if a; G BR{9) = lcij{9),bij+i{9)), 

where [jj{B)j{9) U B(^j{9)} = R for all i, fij{9, a;) > 0 for any x G BR{9). It is possible that 
Fi{6,x) is discontinuous at bij{9). 

Theorem 10. Assume Assumptions (A1)-(A4) and, in addition. 


(Cl)’. L{X) is eontinuously differentiable in X, L{X) and L'x^{X) are bounded for alii; 


(C2)’. for each i. 




max 

e 


'dF,,,{e,x)X ,dF,^,{e,x 


do 


/- 


dx 


\ 


(C3)’. for all i,j, bij{9) is continuously differentiable in 9, and ma-XQ^bC^d))"^ < oo] 


(Cf)’. for all i,j, the functions Fi{9,Cij{9)) and Fi{9,b^j{9)) are continuously differentiable 
in 9, and max^ \F({9,Cij{9))\ < 00 , max^ \F({9,b~j{9))\ < 00 , 

Define Mi{9) = J2i,j{L{cij{9)) — L{bij{9))fi\Fl{9, Cij{9))\. Then Mi{9) > 0 is bounded. If 
Mi{9) > 0 for all 9, the best possible convergence rate for the KW algorithm (2) with hn 
defined by (24) is n~‘^C_ attained by choosing a„ = an~^, a > 2/{5Ki), and 

6n = 


Similar results can be obtained if other methods are used in the generation of random 
variables or if L{X) is a piecewise continuous function of X. The analysis can be applied to 
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general problems such as Monte Carlo optimization of queueing systems and other general 
systems. Although such a generalization is not trivial, the basic idea is the same except that 
the analysis becomes tedious and lengthy. Next we illustrate an application of Theorem 10 
to the optimization of queueing systems [see, e.g. Kleinrock (1976)]. 

Example 1 . GI/G/1 queue with single class of customers. In a GI/G/1 queue, 
there is one server (such as a teller in a bank) and one queue. Upon its arrival, a customer 
enters the server for service if the server is free, otherwise it joins the queue and waits for its 
turn. The service discipline is first-come-first-serve. The server cannot be free if there is at 
least one customer waiting in the queue. Assume that the distribution of interarrival times 
is Gait) and the distribution of service times is Gs{9,t) = p{9)Gl{t) + (1 — p{9))G‘l(t). For 
simplicity, we assume that Gait), Gl(t), and G^it) are independent of 9 and / t^dG{{t) < 
+ooG = 1,2, p{9) is continuously differentiable in 0, Ga(t),Gi(t),j = 1,2, are strictly 
increasing and continuously differentiable in t. In queueing theory, the system time of a 
customer is dehned as the time period from its arrival till departure. Let L{X{9,^)) be the 
average system time of the hrst N customers 

1 N 

i=l 

where Ti{9,^) is the system time of the Ah customer. Then J{9) = E[L{X{9,^))] is the 
mean system time of the first N customers. We want to hnd the optimal parameter 9* to 
minimize J{9). It is known that the analytical form of J{9) is not available for general Ga{t), 
Gl{t), and G^(f) [e.g. Kleinrock (1976)]. So we hnd 9* via the KW algorithm. Assume that 
the queue is initially empty. According to Lindley’s equation [e.g. Kleinrock (1976)]: 

(51) T,(0,O = max{T,_i(0,O-A,O} + A„ To{9,0 = 0, 

where Aj is the interarrival time between the {i — l)th and the Ah customer. Si is the service 
time of the Ah customer. The distributions of Aj and Si are respectively Gait) and Gs{9,t). 
We consider two scenarios. 

Case 1. We hnd 9* through computer simulation. We write a program to simulate 
the GI/G/1 queue. At the nth iteration, we perform two experiments with the same 
to obtain a hn that is dehned by (33). Consider that the inversion method is used in 
the generation of random variables Aj = G~^{ui), Si = G~^{9,Vi)G = 1,2,...,A^. Dehne 
the random factor as .^ = [wi, ^ 2 , •••, mtv, T 1 W 2 , •••, A(,^) = [Ai, A 2 ,..., Aat], S{9,^) = 
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[S'!, S' 2 ,S'at], and X{9,^) = [74(,^), S'(0, Since the function maxja;, 0} is continuous 
in X, L{X) is continuous in X. According to Theorem 10, we know that Mi{9) = 0 since 
both Ga{t) and Gs{9,t) are strictly increasing and continuously differentiable in t. Note 
that L{X) is not differentiable in X. However, L{X) is left and right differentiable with 
bounded one-sided derivatives. A simple modihcation of the proof of Corollary 3 shows 
that Var[hn\ = 0(1). Therefore, the convergence rate for the KW algorithm is jf 

the composition method is used in the generation of S{9,^) according to the distribution 
Gs{9,t), then from Theorems 7 and 8 (which is applicale to the case of multivariates) we 
know that the rate of convergence is 

Case 2. Assume that this is a real system and we want to perform on-line parameter 
adjustment. Let visualize n as the nth day of service. Suppose that the server serves more 
than N customers each day. At the nth day, the server serves customers with parameter 
value 9n and simultaneously collects information of X{9n,^n) which simply is a record of 
interarrival times {A”} and service times {(S'""}. At the end of the nth day, the server 
calculates 

v, = Gs{9n,S^),i = l,2,3,...,N. 

It is easy to verify that each Vi is uniform on [0,1). Then the server dehnes from the 
preceding Vi, i = 1,2 ,..., N, takes a > 0, and 


S{9n + 6n, u ..., Ar' = G;\9n + 6n, Gs{9n, 3^)), t = l,2, ..., N- 

^(^n - Cn) = [^ 1 "’', ^2 3^/], = G;\9,, - G,(0„, ^”)), t = l,2, ..., N. 

If Gs{9,t) = 1 — is exponential, then = {9n + Sn)3^/9n, *5'”’^ = (6*„ — Sn)3^/9n- 


With the values of A(,^„), 3{9n + Sn, Cn), 3{9n + Sn, Cn), from (40) and the form of L{X{9, ^)), 
the server computes L{X{9n -l- Sn,^n)) and L{X{9n — 5n,Cn)), which determines a With 
this hn, the server updates the parameter 9n+i according to the KW algorithm (2) for the 
next (n -|- l)th day. In such a way, we have formulated an on-line optimization problem that 
mimics the Monte Carlo optimization. Its convergence can be analyzed similarly to that 
of Case 1. Our purpose here is simply to point out that the results of this paper are not 
restricted to Monte Carlo optimization. 


7. Summary. So far, we have examined several variations of the KW algorithm and 
the MD algorithm under the symmetric finite difference, the one-sided hnite difference, and 
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the use of CRN when different methods are used in the generation of random variables. The 
results of this paper, together with previous results on the KW algorithm without the use 
of CRN [c.f. Fabian (1971); Kushner and Clark (1978)], provide a complete view toward the 
rates of convergence for the KW algorithm. For the ease of comparison, we summarize all 
the results in the following table. 


Table I. Rates of convergence for the KW/MD algorithm 



with CRN 
h{33) 

with CRN 
h(45) 

without CRN 
fc(3) 

without CRN 
h{23) 

inversion: 
Mi(0) ^ 0 

n-2/5 



^- 1/4 

inversion: 
Mi(0) = 0 

n-i/2 

^- 1/2 

^- 1/3 

^- 1/4 

rejection: 

general 

n-2/5 

^- 1/3 

^- 1/3 

^- 1/4 

composition: 

general 

n-2/5 


^- 1/3 

^- 1/4 


In Table I, h{3), h{23), h{33), and h(45) refer to the hnite-difference approximation hn 
dehned by (3), (23), (33), and (45), respectively. The phrase “without CRN” refers to 
using independent samples in calculating the hnite difference {hn}, which excludes sampling 
schemes that may lead to correlations between the samples. In other words, “without CRN” 
simply means that and ^ 2 ,n are independent in (3) and (23). When the inversion method 
is used in the generation of random variables and when Mi{6) = 0, we assume that Corollaries 
3 and 4 are applicable. Results pertaining to the KW algorithm without the use of CRN 
can be found in, for example, Fabian (1971), and Kushner and Clark (1978). 

Generally speaking, the use of CRN is always helpful in accelerating the convergence of 
the KW algorithm or the MD algorithm. In some cases, such as when Mi{6) = 0 in Theorem 
5, CRN helps a lot. In some of other cases, CRN may help less much. When the inversion 
method is used and when Mi{6) = 0, the convergence rate can reach the best possible 
rate for the two types of stochastic approximation algorithms. The remark at the end of 
Subsection 3.2 shows that, as far as the convergence rate of the KW algorithm is concerned, 
the inversion method is superior to the rejection method. Note that inversion can also be used 
to generate random variables with distributions of the form J2iPi{h)Fi{6,x). A comparison 
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of Theorem 4 and Theorems 7 and 8 shows that inversion is also superior to composition. 
When the distribution function F{6,x) of X{d,^) is strictly increasing and continuous, a 
close examination of the inversion, rejection, and composition methods shows that X{d,^) 
is continuous in 6 if it is generated from inversion. However, X{d,^) is discontinuous in 6 
if it is generated from either rejection or composition. It is such a distinction of continuity 
that determines the rates of the convergence for the KW algorithm. 
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